Geographical analysis of media flows

A multidimensional approach

Claude Grasland (Université de Paris (Diderot), FR 2007 CIST, UMR 8504 Géographie-cités)

featured

Introduction

1 DATA COLLECTION

1.1 Importation of RSS

1.1.1 The Mediacloud database

(tbd : presentation of the MediaCloud project)

Mediacloud can be freely used by researchers. All you have to do is to create an account at the following adress :

https://explorer.mediacloud.org

You have different ways to get title of news. We will focus here on a simple example of data obtained through the mediacloud interface. We suppose that you want to extract news from the Tunisian newspapers speaking from Europe.

1.1.2 Selection of media with source manager

We use the application called Source Manager and we introduce a research by collection which is the most convenient to explore what is available in a country. In our example, the target country is Tunisia and we have three collections that are propsed :

We have selected the collection named “Tunisia National” because we are interested in the most important newspapers of the country.

The buble graphic on the right indicates immediately the media that has produced the highest number of news, but it is wise to explore in more details the list on the left which indicates for each media the statting date of data collection.

When a media appears interesting, we click on its name to obtain a brief summary of the metadata. For example, in the case of L’économiste Maghrebin the metadata indicates :

The media looks promising, but before to go further, it can be better to have a look at the website of the media to have a more concrete idea of the content if we don’t know in advance what it is about in terms of content, what is the ideological orientation, etc.

Here we can see that this is an ecnomic journal, published in french, with news organized in concentric geographic circles (Nation > Maghreb > Africa > World) which is precisely what we are looking for in the IMAGEUN project. We will further complete the informations about this, but before to do that we have to check in more details if the production of the media is regular through time with another tool offered by mediacloud, the explorer.

1.1.3 Checking the stability through time

We have clicked on search in explorer on the metadata page of the Source Manager and obtain a news interfacce where we modify the date to cover the full period of collection of the media (or our period of interest). In the research field, we let the search term * which indicates a research on all news.

Below your request, you obtain a graphic entitled Attention Over Time with the distribution of the number of news published per day which help you to verify if the distribution of news is regular through time. You just have to modify the type of graphic in order to visualize Story Count and you can choose the time span you want (day, week or month) for the evaluation of the regularity of news flow. In our example, we notice that at daily level they are some brief period of break in 2019, but the flow is reasonnabely regular with approximatively 5 news per day at the beginning and 10 to 20 in the final period. We also notice a classical week cycle with a decrease of news published during the week-end.

Going down, you will find a news panel entitled Total Attention which gives you the total number of stories found. In our example, we have a total of 13626 stories produced by our media over the period.

1.1.5 Download and storage of news

According to your selection (all news or a specific topic) you will download more or less title. Here, me make the choice to get all news, which means that we have to repeat the original request with *.

Finally, by clicking on the button Download all story URLS, you can get a .csv file that you can easily load in your favorite programming language as we will see in the next section.

1.2 Corpus creation

knitr::opts_chunk$set(cache = TRUE,
                        echo = TRUE,
                        comment = "")

In the previous section (ref…) whe have obtained a .csv file of news collected from MediaCloud. We will try now to put this data in a standard form and we have chosen the format of the quanteda package as reference for data organization and storage.

But of course the researchers involved in the project can prefer to use other R packages like tm or tidytext. And they can also prefer to use another programming language for Python. It is the reason why we explain how to transform and export the data that has been prepared and harmonized with quanteda in various format like .csv or JSON.

We detail here an example of importation with the example of the newspaper “L’économiste maghrebin”

1.2.1 Importation of text to R

This step is not always obvious because many problems of encoding can appear that are more or less easy to solve. In principle , the data from Media Cloud are exported in standard UTF-8 but as we will see it is not necessary the case.

We try firstly to use the standard R function read.csv():

store <- "data"
  media <- "fr_TUN_ecomag"
  type <-".csv"
  
  fic <- paste(store,"/",media,type,sep="")
  
  df<-read.csv(fic,
               sep=",",
               header=T,
               encoding = "UTF-8",
               stringsAsFactors = F)
  kable(head(df))
stories_id publish_date title url language ap_syndicated themes media_id media_name media_url
1129295780 2019-01-02 03:42:46 Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 https://www.leconomistemaghrebin.com/2019/01/02/tarifs-adsl-reduits-1-janvier-2019/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129295771 2019-01-02 04:06:27 6ème Sfax Marathon International des Oliviers https://www.leconomistemaghrebin.com/2019/01/02/sfax-marathon-international-oliviers/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129295760 2019-01-02 06:05:08 Télécharger la version finale de la Loi de finances 2019 https://www.leconomistemaghrebin.com/2019/01/02/telecharger-la-version-finale-de-la-loi-de-finances-2019/ en False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129578051 2019-01-02 10:05:06 Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public https://www.leconomistemaghrebin.com/2019/01/02/chawki-tabib-245-dossiers-transferes-au-ministere-public/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129461662 2019-01-02 07:52:36 Panoro Energy finalise l’acquisition de OMV Tunisia https://www.leconomistemaghrebin.com/2019/01/02/panoro-energy-finalise-lacquisition-de-omv-tunisia/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/
1129461636 2019-01-02 08:57:54 La partie syndicale maintient le boycott des examens du secondaire https://www.leconomistemaghrebin.com/2019/01/02/partie-syndicale-boycott-examens-secondaire/ fr False 623820 L’Economiste Maghrebin http://www.leconomistemaghrebin.com/

The importation was successfull for 12794 news but message of errors appeared for 3 news where R sent a message of error telling :

Error in gregexpr(calltext, singleline, fixed = TRUE) : regular expression is invalid UTF-8

Looking in more details, we discover also some problems of encoding in news like in the following example where the text of the news appears differently if we apply the standard functions paste() o0 the specialized function r knitr::kable for printing.

paste(df[9, 3])
[1] "Néji Jalloul : &#8220;Nidaa Tounes peut revenir si&#8230;&#8221;"
kable((df[9,3]))
x
Néji Jalloul : “Nidaa Tounes peut revenir si…”

1.2.2 Resolution of encoding problems

It is sometime possible to adapt manually the encoding problem whan they are not too much as in present example.

df$text<-df$title
  # standardize apostrophe
  df$text<-gsub("&#8217;","'",df$text)
  
  # standardize punct
  df$text<-gsub('&#8230;','.',df$text)
  
  # standardize hyphens
  df$text<-gsub('&#8211;','-',df$text)
  
  # Remove quotation marks
  df$text<-gsub('&#171;&#160;','',df$text)
  df$text<-gsub('&#160;&#187;','',df$text)
  df$text<-gsub('&#8220;','',df$text)
  df$text<-gsub('&#8221;','',df$text)
  df$text<-gsub('&#8216;','',df$text)
  df$text<-gsub('&#8243;','',df$text)

We can introduce other cleaning procedures here or keep it for later analysis

1.2.3 Transformation in quanteda format

We propose a storage based on quanteda format by just transforming the data that has been produced by readtext. We keep only the name of the source and the date of publication.

# Create Quanteda corpus
  qd<-corpus(df,docid_field = "stories_id")
  
  
  # Select docvar fields and rename media
  qd$date <-as.Date(qd$publish_date)
  qd$source <-media
  docvars(qd)<-docvars(qd)[,c("source","date")]
  
  
  
  
  # Add global meta
  meta(qd,"meta_source")<-"Media Cloud "
  meta(qd,"meta_time")<-"Download the 2021-09-30"
  meta(qd,"meta_author")<-"Elaborated by Claude Grasland"
  meta(qd,"project")<-"ANR-DFG Project IMAGEUN"

We have created a quanteda object with a lot of information stored in various fields. The structure of the object is the following one

str(qd)
 'corpus' Named chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" ...
   - attr(*, "names")= chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
   - attr(*, "docvars")='data.frame': 12794 obs. of  5 variables:
    ..$ docname_: chr [1:12794] "1129295780" "1129295771" "1129295760" "1129578051" ...
    ..$ docid_  : Factor w/ 12794 levels "1129295780","1129295771",..: 1 2 3 4 5 6 7 8 9 10 ...
    ..$ segid_  : int [1:12794] 1 1 1 1 1 1 1 1 1 1 ...
    ..$ source  : chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
    ..$ date    : Date[1:12794], format: "2019-01-02" "2019-01-02" ...
   - attr(*, "meta")=List of 3
    ..$ system:List of 6
    .. ..$ package-version:Classes 'package_version', 'numeric_version'  hidden list of 1
    .. .. ..$ : int [1:3] 3 0 0
    .. ..$ r-version      :Classes 'R_system_version', 'package_version', 'numeric_version'  hidden list of 1
    .. .. ..$ : int [1:3] 4 1 0
    .. ..$ system         : Named chr [1:3] "Windows" "x86-64" "claude"
    .. .. ..- attr(*, "names")= chr [1:3] "sysname" "machine" "user"
    .. ..$ directory      : chr "C:/git/geomedia"
    .. ..$ created        : Date[1:1], format: "2021-11-25"
    .. ..$ source         : chr "data.frame"
    ..$ object:List of 2
    .. ..$ unit   : chr "documents"
    .. ..$ summary:List of 2
    .. .. ..$ hash: chr(0) 
    .. .. ..$ data: NULL
    ..$ user  :List of 4
    .. ..$ meta_source: chr "Media Cloud "
    .. ..$ meta_time  : chr "Download the 2021-09-30"
    .. ..$ meta_author: chr "Elaborated by Claude Grasland"
    .. ..$ project    : chr "ANR-DFG Project IMAGEUN"

We can look at the first titles with head()

kable(head(qd,3))
x
1129295780 Les tarifs de l’ADSL réduits à partir du 1er janvier 2019
1129295771 6ème Sfax Marathon International des Oliviers
1129295760 Télécharger la version finale de la Loi de finances 2019

We can get meta information on each stories with summary()

summary(head(qd,3))
Corpus consisting of 3 documents, showing 3 documents:

         Text Types Tokens Sentences        source       date
   1129295780    11     11         1 fr_TUN_ecomag 2019-01-02
   1129295771     6      6         1 fr_TUN_ecomag 2019-01-02
   1129295760     8     10         1 fr_TUN_ecomag 2019-01-02

We can get meta information about the full document

meta(qd)
$meta_source
  [1] "Media Cloud "

  $meta_time
  [1] "Download the 2021-09-30"

  $meta_author
  [1] "Elaborated by Claude Grasland"

  $project
  [1] "ANR-DFG Project IMAGEUN"

1.2.4 Storage of the quanteda object

We can finally save the object in .RDS format in a directory dedicated to our quanteda files. It can be usefull to give some information in the name of the file

store <- "data"
  type<- ".RDS"
  myfile <- paste(store,"/",media,type,sep="")
  myfile
[1] "data/fr_TUN_ecomag.RDS"
saveRDS(qd,myfile)
  qd[1:3]
Corpus consisting of 3 documents and 2 docvars.
  1129295780 :
  "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019"

  1129295771 :
  "6ème Sfax Marathon International des Oliviers"

  1129295760 :
  "Télécharger la version finale de la Loi de finances 2019"
summary(qd,3)
Corpus consisting of 12794 documents, showing 3 documents:

         Text Types Tokens Sentences        source       date
   1129295780    11     11         1 fr_TUN_ecomag 2019-01-02
   1129295771     6      6         1 fr_TUN_ecomag 2019-01-02
   1129295760     8     10         1 fr_TUN_ecomag 2019-01-02

We have kept all the information present in the initial file, but also added specific metadata of interest for us. The size of the storage is now equal to 0.6 Mb which means a division by 6 as compared to the initial .csv file downloaded from Media Cloud where the size was 3.8 Mb.

1.2.5 Back transformation to tibble

In the following steps, we will make an intensive use of quanteda, but sometimes it can be useful to export the results in a more practical format or to use other packages. For this reasons, it is important to know that the tidytextpackage can easily transform quanteda object in tibbles which are more classical and easy to manage and to export in other formats like data.frame or data.table.

td <- tidy(qd)
  kable(head(td))
text source date
Les tarifs de l’ADSL réduits à partir du 1er janvier 2019 fr_TUN_ecomag 2019-01-02
6ème Sfax Marathon International des Oliviers fr_TUN_ecomag 2019-01-02
Télécharger la version finale de la Loi de finances 2019 fr_TUN_ecomag 2019-01-02
Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public fr_TUN_ecomag 2019-01-02
Panoro Energy finalise l’acquisition de OMV Tunisia fr_TUN_ecomag 2019-01-02
La partie syndicale maintient le boycott des examens du secondaire fr_TUN_ecomag 2019-01-02
str(td)
tibble [12,794 x 3] (S3: tbl_df/tbl/data.frame)
   $ text  : chr [1:12794] "Les tarifs de l'ADSL réduits à partir du 1er janvier 2019" "6ème Sfax Marathon International des Oliviers" "Télécharger la version finale de la Loi de finances 2019" "Chawki Tabib : 245 dossiers de corruption présumée transmis au ministère public" ...
   $ source: chr [1:12794] "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" "fr_TUN_ecomag" ...
   $ date  : Date[1:12794], format: "2019-01-02" "2019-01-02" ...

2 GEOGRAPHICAL TAGS

The objective of this secion is to explore the possibility of Wikipedia and related tools (Wikidata, Wikimedia, …) for the production of multilingual dictionaries ofgeographical objects like states, cities, continents, regional organization,… In order to test the interest of this approach, we will try to produce multilingual dictionaries for the identification of different types of entities. For example :

  • Europe and its subregions
  • Africa and its subregions
  • Asia and its subregions
  • Mediterranea
  • Middle East, Near East, Persian Gulf …

The dictionary will be established in four languages in order to check if the results are really comparable and if it is possible to elaborate cross-language analysis of media.

  • english : applied to media of UK and Ireland
  • french : applied to media of France and Tunisia
  • german : applied to media of Germany
  • turkish : applied to media of Turkey

2.1 Wikipedia entities

Wikidata defines itself as

  • a free and open knowledge base that can be read and edited by both humans and machines.
  • as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
  • a support to many other sites and services beyond just Wikimedia projects! The content of Wikidata is available under a free license, exported using standard formats, and can be interlinked to other open data sets on the linked data web.

2.1.1 Codification of entities

The first interest of wikidata is to provide unique code of identifications of objects. For example a research about “Africa” will produce a list of different objects characterized by a unique code :

knitr::include_graphics("figures/Wikidata001.png")

2.1.2 Informations on entities

Once we have selected an entity (e.g. Q15) we obtain a new page with more detailed informations in english but also in all other languages available in Wikipedia.

knitr::include_graphics("figures/Wikidata002.png")

A lot of information are available concerning the entity but, at this stage, the most important ones for our research are :

  1. the translation in different languages
  2. the equivalent words or expression in different languages
  3. the definitions in different languages
  4. the ambiguity of the term in each language and the potential risks of confusion with other entities.

Of course we should not take for granted the answers proposed by wikidata (as noticed by Georg, Wikipedia is a matter of research for IMAGEUN …) but without any doubt, it offers a very good opportunity to clarify our questions and help us to build tools for recognition of world regions and other geographical imaginations in a multilingual perspective.

2.1.3 Wikipedia entities as nodes of an ontolongy

It appears crucial to introduce here a clear distinction between Wikipedia entities and textual units associated to the names and definiton of this units.

A wikipedia entity like Q15 is an element of an ontology designed by its author for specific purposes. The specificity of the wikidata ontology is the fact that it is a multilinligual web where Q15 is a node of the web present in different linguistic layers. It means that we don’t have a single name or a single definition of Q15, except if we adopt the neocolonial perspective to choose the english language as reference. Depending on the context (i.e. the language or sub-language), Q15 could be defined as :

  • (fr) : A “continent” named “Afrique”"
  • (en) : A “continent on the Earth’s northern and southern hemispheres” named “Africa” or “African continent”
  • (de) : A “Kontinent auf der Nord- und Südhalbkugel der Erde” named “Afrika”
  • (tr) : A “Dünya’nın kuzey ve güney yarıkürelerindeki bir kıta” named “Afrika” or “Afrika kıtası”

In other words the existence of the same code of wikipedia entities does not offer any guarantee of concordance between the geographical objects found in news published in different languages or different countries. But - and it is the important point - it help us to point similarities and differences between set of geographical entities that are more or less comparable in each language.

2.1.4 A tool for cross-linguistical experiments

Having in mind the limits of the equivalence of entities across languages, it can nevertheless be an interesting experience to select a set of wikipedia entities (Q15, Q258, Q4412 …) and to examine their relative frequency in our different media from different countries with different languages. A typical hypothesis could be something like :

  • Is Q15 more mentionned than Q46 in Tunisian newspapers ?

which is not equivalent to the question

  • Is Africa more mentionned than Europe in Tunisian newspapers

2.2 The package WikidataR

The package WikidataR is an interface for the use of the Wikidata API in R language. Equivalent tools are available in Python and other languages for those non familiar with R. And it is of course possible to use directly the API. The first step is to install the most recent version of the R package WikidataR which install also related packages of interest.

#install.packages("WikidataR")
  library(WikidataR)

(based on Etienne Toureille previous experiments)

2.2.1 identification of entities of interest

The function find_item will help to find all wikipedia entities (=items) associated to a textual unit (word or group of word) in given language. Let’s start with the research of entities associated to “Afrique” in french language :

mytext <- "Afrique"
  
  items <- find_item(search_term = mytext,
                     language = "fr",
                     limit=30)
  class(items)
[1] "find_item"
length(items)
[1] 30

The resulting object is an object from type find_item which is in practice a list describing the entities that has been recognized associated to the textual unit that we have chosen. In the french cas, we have found 50 entities that match with our textual unit. Let’s have a look at the first one :

items[[1]]
$id
  [1] "Q15"

  $title
  [1] "Q15"

  $pageid
  [1] 111

  $repository
  [1] "wikidata"

  $url
  [1] "//www.wikidata.org/wiki/Q15"

  $concepturi
  [1] "http://www.wikidata.org/entity/Q15"

  $label
  [1] "Africa"

  $description
  [1] "continent on the Earth's northern and southern hemispheres"

  $match
  $match$type
  [1] "label"

  $match$language
  [1] "fr"

  $match$text
  [1] "Afrique"


  $aliases
  $aliases[[1]]
  [1] "Afrique"

As we can see we can easily identify the code the label and description in english but also the text responsible from the matching answer in french. We can therefore create a function item_info that extract all elements of interest and put them in a table in order to have a complete view.

item_info <- function(my_item){ 
    
  
      if (is.null(my_item$id) == F){item_id = my_item$id}
          else {item_id  = NA}
    
      if (is.null(my_item$label) ==F){item_label = my_item$label}
          else {item_label  = NA}
    
      if (is.null(my_item$desc) == F) {item_desc= my_item$desc}
          else {item_desc  = NA}
    
      if (is.null(my_item$match$lang) ==F){item_lang = my_item$match$lang}
          else {item_lang  = NA}
    
      if (is.null(my_item$match$text) ==F){item_text = my_item$match$text}
          else {item_text  = NA}
  
    
    res<-data.frame(item_id,item_label,item_desc,item_lang,item_text)
    
    return(res) 
    }

For example :

item_info(items[[1]])
  item_id item_label                                                  item_desc
  1     Q15     Africa continent on the Earth's northern and southern hemispheres
    item_lang item_text
  1        fr   Afrique

We build then a second function that extract all the wikipedia entities associated to a textual unit for a given language

extract_entities <- function(mytext= "Afrique",
                               mylang = "fr",
                               maxres = 20) {
    # Extract items
    items <- find_item(search_term =  mytext,
                     language      =  mylang,
                     limit         =  maxres)
    
    # Create empty dataset
    res<-data.frame()
    res$item_id    <- as.character()
    res$item_label <- as.character()
    res$item_desc <- as.character()
    res$item_lang  <- as.character()
    res$item_text  <- as.character()
    
    # Fill dataset
    k<-length(items)
        for (i in 1:k) {
             res <- rbind(res,item_info(items[[i]]))
        }
    
    # Return dataset
    return(res)
  
  }

For example :

tab <- extract_entities("Afrique","fr",20)
  kable(tab)
item_id item_label item_desc item_lang item_text
Q15 Africa continent on the Earth’s northern and southern hemispheres fr Afrique
Q181238 Africa Roman province on the northern African coast covering parts of present-day Tunisia, Algeria, and Libya fr Afrique
Q203548 African Plate continental plate underlying Africa fr Afrique
Q258 South Africa sovereign state in Southern Africa fr Afrique du Sud
Q27433 Central Africa core region of the African continent fr Afrique centrale
Q4412 West Africa region of Africa fr Afrique de l’Ouest
Q132959 Sub-Saharan Africa area of the continent of Africa that lies south of the Sahara Desert fr Afrique subsaharienne
Q27394 Southern Africa southernmost region of the African continent fr Afrique australe
Q27407 East Africa easterly region of the African continent fr Afrique de l’Est
Q27381 North Africa northernmost region of the African continent fr Afrique du Nord
Q2826196 Afrique Wikimedia disambiguation page fr Afrique
Q23639892 Africa artwork by Eugène Delaplanche in Paris, France fr Afrique
Q66022909 Afrique NA fr Afrique
Q153963 German East Africa former German posesssion in the African Great Lakes region between 1884–1919 fr Afrique orientale allemande
Q4690138 Afrique album by Count Basie fr Afrique
Q65574303 Afrique NA fr Afrique
Q56317928 Afrique NA fr Afrique
Q210682 French West Africa French colonial federation (1895–1958) fr Afrique-Occidentale française
Q106179043 Afrique NA en Afrique
Q271894 French Equatorial Africa federation of French colonial possessions in Central Africa fr Afrique-Équatoriale française

As we can see, many of the entities proposed in he list are not interesting and we will probably have to select one by one the entities of interest. But we have clearly to keep two different list of entities :

  • the target entities : that we consider as potential world regions or candidate to te title of “geographic imagination”.
  • the control entites : that we have to identify or eliminate if we want to identify correctly our target entities like the country of South Africa

In the case of Africa, we could for example establish a more limited list

entit <- c("Q15", "Q4412","Q132959", "Q27394","Q27407","Q27381","Q27433","Q258")
  
  tab<-tab %>% filter(item_id %in% entit)
  kable(tab)
item_id item_label item_desc item_lang item_text
Q15 Africa continent on the Earth’s northern and southern hemispheres fr Afrique
Q258 South Africa sovereign state in Southern Africa fr Afrique du Sud
Q27433 Central Africa core region of the African continent fr Afrique centrale
Q4412 West Africa region of Africa fr Afrique de l’Ouest
Q132959 Sub-Saharan Africa area of the continent of Africa that lies south of the Sahara Desert fr Afrique subsaharienne
Q27394 Southern Africa southernmost region of the African continent fr Afrique australe
Q27407 East Africa easterly region of the African continent fr Afrique de l’Est
Q27381 North Africa northernmost region of the African continent fr Afrique du Nord

But this list which was based on the french textual units associated to “Afrique” should certainly be completed by equivalent list established for other languages with different seeds (“Africa” in english, “Afrika” in german, …)

2.2.2 Elaboration of a cross_linguistic dictionnary

Admitting that we have established a list of wikipedia entities of interest, we can now turn to the creation of a dictionary for the identification of these entities in different languages. We will use for that purpose the powerful function get_properties

item_prop <- get_property("Q15")[[1]]

The result is a very large object (list of list) which provide all the informations (or links toward these information) in all languages wher the object is available. The problem is therefore to understand the structure of this object and to extract exactly what we need. In our case, we want to extract for each language of interest.

The information will be separated in two datasets :

  • dictionary of definitions
  • dictionary of labels and aliases

We create two functions dedicated to each of the tasks

extract_def <- function(item = c("Q15", "Q246"),
                          langs = c("fr","de","en","tr")) {
    # Create empty dataset
    res<-data.frame()
    res$id    <- as.character()
    res$lang  <- as.character()
    res$label <- as.character()
    res$desc  <- as.character()
    
    
    # Loop of items
    n <- length(item)
    for (i in 1:n) {
      
       # Extract item properties
      item_prop <- get_property(item[i])[[1]]
    
     
       # Loop  of language
       p<-length(langs)
       for (j in 1:p) {
          id <- item[i]
          lang  <- langs[j]
          if(is.null(item_prop[["labels"]][[lang]]$value)==F) {label <- item_prop[["labels"]][[lang]]$value}
             else { label <- NA}
          if(is.null(item_prop[["descriptions"]][[lang]]$value)==F) {desc <- item_prop[["descriptions"]][[lang]]$value}
             else { desc <- NA}
          add <-data.frame(id,lang,label,desc)
          res<- rbind(res,add) 
          }
    
    }
    # Export result
  return(res)
  
  }

The function works proprerly as long as the entities are available in all languages. It should be adapted to prevent errors when an entity is not available in one language.

entit <- c("Q15", "Q4412","Q132959", "Q27394","Q27407","Q27381","Q27433","Q258")
  
  
  tab<-extract_def(entit,c("fr","de","tr","en"))
  kable(tab)
id lang label desc
Q15 fr Afrique continent
Q15 de Afrika Kontinent auf der Nord- und Südhalbkugel der Erde
Q15 tr Afrika Dünya’nin kuzey ve güney yarikürelerindeki bir kita
Q15 en Africa continent on the Earth’s northern and southern hemispheres
Q4412 fr Afrique de l’Ouest région d’Afrique
Q4412 de Westafrika Kontinentalteil
Q4412 tr Bati Afrika Afrika’nin batisindaki 16 ülkenin bulundugu alan
Q4412 en West Africa region of Africa
Q132959 fr Afrique subsaharienne partie du continent africain au sud du Sahara
Q132959 de Subsahara-Afrika südlich der Sahara gelegener Teil Afrikas
Q132959 tr Sahraalti Afrika NA
Q132959 en Sub-Saharan Africa area of the continent of Africa that lies south of the Sahara Desert
Q27394 fr Afrique australe région la plus méridionale du continent africain
Q27394 de Südliches Afrika Region in Afrika
Q27394 tr Güney Afrika NA
Q27394 en Southern Africa southernmost region of the African continent
Q27407 fr Afrique de l’Est région d’Afrique
Q27407 de Ostafrika Region in Afrika
Q27407 tr Dogu Afrika NA
Q27407 en East Africa easterly region of the African continent
Q27381 fr Afrique du Nord région en Afrique
Q27381 de Nordafrika Region in Afrika
Q27381 tr Kuzey Afrika Afrika kitasinin Fas, Cezayir, Tunus, Libya, Misir ve Sudan’i içeren kuzey bölgesi
Q27381 en North Africa northernmost region of the African continent
Q27433 fr Afrique centrale Région d’Afrique
Q27433 de Zentralafrika Region in Afrika
Q27433 tr Orta Afrika Afrika kitasinin Burundi, Orta Afrika Cumhuriyeti, Çad, Kongo Demokratik Cumhuriyeti ve Ruanda’yi barindiran orta kismi
Q27433 en Central Africa core region of the African continent
Q258 fr Afrique du Sud pays d’Afrique
Q258 de Südafrika Staat im südlichen Afrika
Q258 tr Güney Afrika Cumhuriyeti Güney Afrika’da bulunan bir ülke
Q258 en South Africa sovereign state in Southern Africa

2.2.3 Extraction of aliases

Now we have to extract the aliases which are two texts corresponding to the same entity in a given,language. For example, the Q27394 which correspond to the southern part of Africa (a subregion, not a country) is associated in spanish language to one main label and three equivalenbt alisases

item_prop <- get_property("Q27394")[[1]]
  item_prop$labels$es$value
[1] "África austral"
item_prop$aliases$es
  language             value
  1       es África meridional
  2       es    África del Sur
  3       es     sur de África

But in french language, no aliases are mentioned :

item_prop$labels$fr$value
[1] "Afrique australe"
item_prop$aliases$fr
NULL

The fact that no aliases are mentioned in french language can be considered as non logical as compared to spanish language. And we could certainly imagine to add in french the translation of two spanish aliases: “Afrique méridionale”, “Sud de l’Afrique”. But we can not add “Afrique du Sud” because it is related in french to the state and not to the subregion.

Despite the fact that they are not complete, the aliases are certainly a good solution when we want to obtain more efficient dictionaries. For example, if we want to obtain the state of southern Africa (Q258), we can complete the official label by 4 alias in french language and 3 aliases in spanish, taking into account the fact that the text is in upper orlowercase, withor without accent, …

item_prop <- get_property("Q258")[[1]]
  item_prop$labels$es$value
[1] "Sudáfrica"
item_prop$aliases$es
  language                  value
  1       es República de Sudáfrica
  2       es              Sudafrica
  3       es Republica de Sudafrica
item_prop$labels$fr$value
[1] "Afrique du Sud"
item_prop$aliases$fr
  language                       value
  1       fr    République sud-africaine
  2       fr République d’Afrique du Sud
  3       fr    république sud-africaine
  4       fr république d’Afrique du Sud
lang="fr"
  is.null(item_prop[["aliases"]][[lang]])!=F
[1] FALSE
ali <- item_prop[["aliases"]][[lang]]$value
  n<-length(ali)
  for (i in 1:n) { print(ali[i])}
[1] "République sud-africaine"
  [1] "République d’Afrique du Sud"
  [1] "république sud-africaine"
  [1] "république d’Afrique du Sud"

We propose therefore a function called extract_alias which propose for each entity of interest a list of texts and aliases adapte to each language. We do not store the definition which has been otained previously with the function extract_def :

extract_alias <- function(items = c("Q15", "Q258"),
                            langs = c("fr","de","en","tr")) {
    # Create empty dataset
    res<-data.frame()
    res$id    <- as.character()
    res$lang  <- as.character()
    res$label <- as.character()
  
    
    # Loop of items
    n <- length(items)
    for (i in 1:n) {
      
       # Extract item properties
      item_prop <- get_property(items[i])[[1]]
    
     
       # Loop  of language
       p<-length(langs)
       for (j in 1:p) {
          id <- items[i]
          lang  <- langs[j]
          if(is.null(item_prop[["labels"]][[lang]]$value)==F) {label <- item_prop[["labels"]][[lang]]$value} else { label <- NA}
          if(is.null(item_prop[["descriptions"]][[lang]]$value)==F) {desc <- item_prop[["descriptions"]][[lang]]$value}else { desc <- NA}
          add <-data.frame(id,lang,label)
          res<- rbind(res,add) 
             # Loop of aliases
                if (is.null(item_prop[["aliases"]][[lang]])==F) {
                  ali <- item_prop[["aliases"]][[lang]]$value
                  n<-length(ali)
                 for (k in 1:n) {
                          label <- ali[k]
                          add <-data.frame(id,lang,label)
                          res<- rbind(res,add) 
                    }
                }
          
          }
    
    }
    # Export result
  return(res)
  
  }

Let’s try the function on the case of the continent of “Africa” (Q15), the subregion “South of Africa” (Q27394) and the state of “Southern Africa” (Q258) in five languages :

tab<- extract_alias(items = c("Q15", "Q27394", "Q258"),
                langs = c("fr","de","en","tr"))
  kable(tab)
id lang label
Q15 fr Afrique
Q15 de Afrika
Q15 en Africa
Q15 en African continent
Q15 en Ancient Libya
Q15 tr Afrika
Q15 tr Afrika kitasi
Q27394 fr Afrique australe
Q27394 de Südliches Afrika
Q27394 de Südafrika
Q27394 en Southern Africa
Q27394 tr Güney Afrika
Q258 fr Afrique du Sud
Q258 fr République sud-africaine
Q258 fr République d’Afrique du Sud
Q258 fr république sud-africaine
Q258 fr république d’Afrique du Sud
Q258 de Südafrika
Q258 de Suedafrika
Q258 de Republik Südafrika
Q258 en South Africa
Q258 en Republic of South Africa
Q258 en RSA
Q258 en SA
Q258 en za
Q258 en <U+0001F1FF><U+0001F1E6>
Q258 en zaf
Q258 tr Güney Afrika Cumhuriyeti

The function works !

2.2.4 Conclusion

It is now possible to develop a global research strategy for the analysis of world regions :

1. Define a set of target regions in one language : In our example, it was based on the use of the term “Afrique” in french language, but we can imagine a different list.

2. Identify the code of the wikidata entities associated to this target regions : We have generally a lot of entities of minor interest.

3. Identify the code of the other wikidata entities that should be added for control : As we have seen, some entities are likely to create confusion and ambiguity in the definition of target entities. This entity will be transformed in compound or eliminate from the text before to look for the target entities.

4. Extract the properties of the entities in the different languages of interest : this step can be an opportunity to return to step 1. For example, it it appears that some subdivisions of Africa are available in english or german language but not in french.

5. Compare the definitions of Wikipedia entities in different languages : it is important to check if the assumption of identity of the entities is correct or not. If not, some entities will be eliminated from the list.

6. Extract the dictionary of recognition of entities : which can be done in a multilanguage perspective.

It is obviously possible to apply the same procedure to different objects like states, capital cities, organizations, people, etc…

Bibliographie

BARNIER, Julien, 2021. rmdformats: HTML Output Formats and Templates for ’rmarkdown’ Documents [en ligne]. S.l. : s.n. Disponible à l'adresse : https://github.com/juba/rmdformats.
R CORE TEAM, 2020. R: A Language and Environment for Statistical Computing [en ligne]. Vienna, Austria : R Foundation for Statistical Computing. Disponible à l'adresse : https://www.R-project.org/.
XIE, Yihui, 2020. knitr: A General-Purpose Package for Dynamic Report Generation in R [en ligne]. S.l. : s.n. Disponible à l'adresse : https://CRAN.R-project.org/package=knitr.

Annexes

Infos session

setting value
version R version 4.1.0 (2021-05-18)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate French_France.1252
ctype French_France.1252
tz Europe/Paris
date 2021-11-25
package ondiskversion source
dplyr 1.0.6 CRAN (R 4.1.0)
ggplot2 3.3.3 CRAN (R 4.1.0)
knitr 1.34 CRAN (R 4.1.1)
quanteda 3.0.0 CRAN (R 4.1.0)
readtext 0.80 CRAN (R 4.1.0)
rmarkdown 2.11 CRAN (R 4.1.1)
rzine 0.1.0 gitlab ()
tidytext 0.3.1 CRAN (R 4.1.1)
WikidataR 2.3.1 CRAN (R 4.1.1)

Citation

@Manual{ficheRzine,
    title = {Titre de la fiche},
    author = {{Auteur.e.s}},
    organization = {Rzine},
    year = {202x},
    url = {http://rzine.fr/},
  }


Glossaire